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The National Science Foundation's EarthCube End User Workshop was held at USC Wrigley 
Marine Science Center on Catalina Island, California in August 2013. The workshop was de- 
signed to explore and characterize the needs and tools available to the community that is fo- 
cusing on microbial and physical oceanography research with a particular emphasis on 'omic 
research. The assembled researchers outlined the existing concerns regarding the vast data 
resources that are being generated, and how we will deal with these resources as their vol- 
ume and diversity increases. Particular attention was focused on the tools for handling and 
analyzing the existing data, on the need for the construction and curation of diverse federated 
databases, as well as development of shared, interoperable, "big-data capable" analytical 
tools. The key outputs from this workshop include (i) critical scientific challenges and cyber 
infrastructure constraints, (ii) the current and future ocean 'omics science grand challenges 
and questions, and (iii) data management, analytical and associated and cyber-infrastructure 
capabilities required to meet critical current and future scientific challenges. The main thrust 
of the meeting and the outcome of this report is a definition of the 'omics tools, technologies 
and infrastructures that facilitate continued advance in ocean science biology, marine bioge- 
ochemistry, and biological oceanography. 



Introduction 

A large group of ocean scientists and oceanog- 
raphers are now employing '"omics" approaches 
to characterize and quantify the nature, distribu- 
tion and function of organisms in ocean ecosys- 
tems [1-3]. '"Omics" is defined here as the collec- 
tive molecular or biochemical characterization of 
pools of biological molecules, such as genes and 
genomes, transcripts and transcriptomes, proteins 
and proteomes, and small molecules, metabolites 
and metabolomes, that together encode the struc- 
ture and function of an organism or organisms. 




and can be used to explore their dynamics and 
flexibilities. The tools and datasets that encom- 
pass 'omics science are diverse, complex, and rap- 
idly expanding, and require the construction, 
curation, and query of diverse federated databases, 
as well as the development of shared, interopera- 
ble, "big-data capable" analytical tools. Given the 
trajectory of "next generation" sequencing tech- 
nologies, economics, and applications, this arena 
represents a major "big data challenge" for the 
ocean science community at large. 
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To discuss the 'omic data challenges for ocean sci- 
entists, an NSF EarthCube end user workshop was 
held at the USC Wrigley Marine Science Center on 
Catahna Island, California in August 2013. The 
meeting brought together a group of scientists 
with experience in ocean science, environmental 
genomics and allied sciences, biological oceanog- 
raphy, bioinformatics and computer science, as 
well as NSF and private Foundation program 
managers. A main goal of the Ocean Omics NSF 
EarthCube end user workshop was to help identify 
and prioritize a set of scientific drivers and 
cyberinfrastructure requirements necessary to 
enable the storage, curation, federation, and com- 
parative analyses of large and small scale ocean 
science genomic, metagenomic, 

metatranscriptomic and metaproteomic datasets 
that are rapidly accumulating. Although the collec- 
tion, availability and analyses of these and similar 
datasets are improving our understanding of eco- 
system processes and predicting their future tra- 
jectories, the necessary computational and analyt- 
ical tools and infrastructures to manage, share, 
analyze and visualize them needs accelerated de- 
velopment and expansion. Workshop participants 
discussed these current challenges, and identified 
specific tools, technologies and infrastructures 
that will be required to continue advancing 'omics 
applications in ocean science biology, marine bio- 
geochemistry marine biology, and biological 
oceanography in the 21st century. 

Background and purpose of meeting 

The NSF EarthCube initiative was launched in June 
2011 to seek "transformative concepts and ap- 
proaches to create integrated data management 
infrastructures across the Geosciences." NSF and a 
community of U.S. geoscientists and 
cyberscientists have recognized that "for 
EarthCube to achieve its potential as a new data 
and knowledge management system for the 21st 
Century, the collective needs and desires of geo- 
scientists across the disciplines must be made 
known so similarities and difference between user 
groups and disciplines can be identified and ad- 
dressed." To this end, the NSF Geosciences Direc- 
torate solicited proposals to conduct domain 
workshops "designed to listen to the needs of the 
end-user groups that make up the geosciences and 
associated research groups and to understand bet- 
ter how data-enabled science can help them 
achieve their scientific goals." 



The overall purpose of the August 2013 Catahna 
end user workshop was to develop and articulate 
a set of unifying scientific and computational re- 
quirements shared by ocean 'omic scientists. Par- 
ticipants were challenged to envision new ways to 
integrate the community's data collection, archiv- 
ing and analyses, and scientific efforts, from the 
perspectives of both domain-specific ocean scien- 
tists as well as computer scientists. The workshop 
participants discussed the available and existing 
suite of tools and technologies available to per- 
form the large scale 'omics experiments and ana- 
lytics, identified gaps in existing infrastructures, 
and attempted to forecast potential future direc- 
tions for these fields. 

Specific goals of the Ocean 'Omics Workshop: 

1. Identify the critical scientific challenges 
and cyberinfrastructure constraints for 
ocean 'omic science. 

2. Develop a set of relevant ocean 'omic sci- 
ence use-cases that identify and combine 
compelling science drivers with expHcit 
cyberinfrastructure needs. 

3. Identify the data management, analytical 
and associated cyberinfrastructure capa- 
bilities required to address the critical 
ocean 'omic scientific challenges, both cur- 
rent and future. 

Participants 

The participants [see Participant List) were invit- 
ed based on: 1) their scientific and technical expe- 
rience and interest in the scientific questions chal- 
lenges in the context of ocean 'omics science; and 
2) their knowledge of cyberinfrastructure tech- 
nologies, appHcations, and current capabilities, in 
the context of ocean 'omics science and 'omics in 
general. 

Outputs and Conclusions 

I. Critical scientific challenges and 

cyberinfrastructure constraints 

There are many challenges that a community must 
face if it is to design and implement high impact 
interdisciplinary science. Primary among these is 
communication, with the need to develop a com- 
mon language to minimize misunderstanding and 
misinterpretation when discussing project design, 
implementation and analyses. Currently, there ex- 
ist a number of different databases for exploring 
metagenomic, other 'omic, and environmental da- 
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tasets in the context of ocean science ([4-7]). 
However, a common language to facilitate com- 
munication must be built on a series of standardi- 
zation efforts. The internet is a prime example of 
this, whereby all computers used standard lan- 
guages to facilitate exquisitely integrated interac- 
tions across the world, enabling communication 
between myriad disciplines. However, it is still a 
challenge for any community to develop, validate 
and implement standardized and federated pro- 
cedures for sample collection schemes, sample 
QC/QA, data formats, annotation workflows, and 
data analyses. Even more complex is the task of in- 
tegrating those with geochemical, biological and 
physical oceanographic data over multiple nested 
spatiotemporal scales, to allow researchers from 
different scientific disciplines to interact and actu- 
ally use the data being generated. Grassroots ef- 
forts such as the Genomic Standards Consortium 
[8]; have overseen the development of standard 
formats and languages for describing how se- 
quencing data was generated and for capturing 
the contextual environmental data [physical, 
chemical and biological data streams) in a com- 
mon, machine-readable format. These efforts are 
perceived widely as facilitating data sharing, and 
data re-use, by limiting the need for detailed liter- 
ature searches, and enabling meta-analyses of ex- 
isting data resources (in this case genomics) for 
generating novel high-impact science. However, 
these efforts are still limited in their scope and de- 
spite considerable work and integration with pub- 
lic databases for sequence data (e.g. INSDC, 
MGRAST, IMG/M, CAMERA, etc.), uptake and in- 
corporation by the community takes time, and is 
currently still limited. There are a number of rea- 
sons for the slow adoption of community-wide 
standards and practices, briefly explored below. 

A primary concern, raised in the workshop was 
the lack of access that the community has to data 
storage space, and transfer mechanisms for the 
sharing and archiving of raw data, processed data, 
data products from workflows, and records of the 
provenance of data analyses. This concern is com- 
pounded by the limited access to large scale, high 
performance compute capabilities necessary for 
the annotation, comparison, statistical analyses 
and other workflows required for analyses of 
large scale ocean 'omic datasets. Even with com- 
mon languages to describe and share sequence da- 
ta that could aid interaction in the absence of any 
technical impediment, there is a continued need 
for the development of these standards as new se- 



quence types, and non-sequence-based data types 
(e.g. mass spectrometry used in proteomics and 
metabolomics) emerge, that also will need to be 
stored, accessed and analyzed and federated with 
other environmental and 'omic data streams. 

Currently, the community also lacks sufficient 
tools for analysis and simultaneous visualization 
and inter-comparison of heterogeneous data types 
(e.g., environmental, 'omic and oceanographic da- 
tasets). This concern is also a primary factor limit- 
ing the integration of emerging 'omics datasets 
and analyses with existing and developing physi- 
cal and biogeochemical models. This is partly an 
analytical problem (e.g., the mapping of genes and 
pathways onto their respective biogeochemical 
activities), and partly an integration problem, re- 
quiring the combination of quantitative 'omics- 
derived biogeochemical information, with quanti- 
tative geophysical and geochemical models. The 
development of better analytical and visualization 
tools, and modeling platforms to capture transla- 
tion knowledge must come from the community, 
and be driven by community need so as to ensure 
that these products are both relevant and up-to- 
date. However, focus and funding for developing 
these tools must still come from the agencies, 
since the'cool tools' that we take for granted 
(iphone apps, facebook, professional software 
platforms, etc.) will always have a shelf life, and 
lack the interface which enables researchers to 
overcome technical education barriers to use. Fa- 
cilitating the development of both the software 
tools that improve analysis and visualization of 
ocean omic datasets and of the platforms that fa- 
cilitate integrated modeling of diverse data 
streams is essential if we are to fully capitalize on 
existing investment in current research. However, 
this will also take both innovation and sustained 
investment, along with a certain degree of com- 
munity consensus on the existing tool infrastruc- 
ture that is required to 'do the job right'. A related 
issue is the efficient distribution and dissemina- 
tion of bioinformatics tools. Often these tools are 
developed in individual laboratories without intu- 
itive user interfaces and in formats or with de- 
pendencies on other software that hinder their 
utilization by the broader community. Develop- 
ment of procedures, best practices, and infrastruc- 
ture to facilitate the dissemination of such tools is 
required to capture and coordinate community- 
driven advances in analytical capabilities. 

The majority of our community is dispersed 
through academic and federal labs that differ vast- 
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ly with regards to institutional resources for em- 
powering large scale computing. Major advances 
for elucidating meaningful interpretations of 
'omics data will require investments in computing 
and informatics infrastructure that can be utilized 
and adapted by users regardless of institutional 
access. If resources don't become available across 
the community, we will have institutional winners 
and losers, whereby the scientific home of a re- 
searcher or student will largely dictate their abil- 
ity to work with 'omic scale data. 

II. Ocean 'omics science challenges 
and questions: current and future 

The rapidly increasing throughput and declining 
costs of producing 'omics data offers new oppor- 
tunities to address pressing issues in ocean sci- 
ences. Several high-priority science questions 
were identified that hold promise for significant 
advances through application of omic approaches 
and that will likely be the focus of interdiscipli- 
nary efforts during the next 5-15 years. Several 
science questions and challenges were identified 
as promising use case scenarios, that combine 
compelling science drivers with explicit 
cyberinfrastructure needs. 

Science Question and Challenge # 1 

"How do biological population structure and func- 
tion co-vary with physical and chemical oceano- 
graphic parameters within and between different 
oceanographic provinces?" The physical and chem- 
ical environment shapes the structure and func- 
tion of marine microbial communities, and micro- 
bial communities in turn influence the chemistry 
of the seas. Over the past five years, it has become 
possible to deeply characterize diverse microbial 
communities at the genomic level and to track the 
expression of numerous genomes across space 
and time. At least from a data acquisition stand- 
point, we are now poised to address questions 
such as: 

• How do steep physical and chemical gradi- 
ents result in steep microbial functional 
gradients and drive changes in microbial 
biodiversity? 

• How do microbial communities in the 
ocean fluctuate across key boundaries and 
gradients, such as distance from land, sea- 
floor spreading centers, gyres, and 
upwelling zones? 



• How do microbial communities change as 
a function of geochemistry, currents, and 
crustal age? 

• How do microbial community dynamics 

affect the flux of matter and energy 
throughout the ocean's water column, ben- 
thos and subsurface? 

One of the greater challenges in addressing the 
above questions is to rapidly generate, analyze, 
annotate and make publically accessible the rapid- 
ly accumulating new, large scale omic datasets and 
metadata. Another choke point is the availability 
of genomic data for key organisms, that is general- 
ly limited to what has been published in GenBank. 
As such, researchers wishing to map their 
transcriptomic data against available genomes 
will be limited to what is available at any given 
time. Furthermore, the cycle time and computer 
resources available for analyses are also limited. 
Publishing of further resources in the public do- 
main, and placing these data resources in cloud 
computing infrastructure (for both storage and 
analytical purposes), will greatly facilitate answer- 
ing these questions. 

Science Question and Challenge # 2 

"What are the underlying molecular and biochemi- 
cal mechanisms that regulate the physiological re- 
sponses of microbes to environmental change, and 
their downstream biogeochemical consequences 
and feedbacks?" The capacity to deeply track the 
content and expression of microbial genomes 
across space and time provides windows into the 
genetic responses of microbes to environmental 
change. Such dynamics can be observed both in 
the laboratory and in the field. In the next 5-10 
years, as ocean 'omics datasets continue to grow 
in temporal and spatial coverage, there will be in- 
creasing and emergent opportunities for meta- 
analyses that characterize responses of microbes 
to environmental perturbation. One can now envi- 
sion 'omics data resolving longer-term microbial 
responses, such as dynamics on decadal time 
scales, in much the same way that large-scale 
physical and chemical data currently provide pic- 
tures of climate change. In some cases these in- 
sights may uncover well-known organisms, path- 
ways, or genes, while in other cases an observa- 
tional approach may highlight unknown players 
[organisms, pathways, or genes) as key respond- 
ers to perturbation and mediators of feedbacks. 
Hence, if the data is effectively preserved and ar- 



1254 



Standards in Genomic Sciences 



Gilbert et al. 



chived, 'omic datasets could represent powerful 
means of discovery and hypothesis generation. 
Central science questions here include: 

• What are the underlying molecular and bi- 
ochemical mechanisms that regulate the 
physiological responses of microbes to en- 
vironmental change, and their down- 
stream biogeochemical consequences and 
feedbacks? 

• How does 'omic and population plasticity 
in microbes bolster ecosystem resilience 
to disturbances? 

• How do global change and environmental 
disturbance impact genomic repertoires, 
transcriptional organization, protein and 
metabolome content, and biogeochemical 
activity? 

• Which microbial taxa and processes are af- 
fected by rapid polar climate change, and 
how do those taxa impact the budget of 
greenhouse gases, permafrost thawing and 
dissolved organic carbon release and 
transport in time and space? 

Science Question and Challenge # 3 

"Can 'omics data be used to describe and model eco- 
system processes and their trajectories?" To date, 
omics information has largely been utilized to un- 
cover specific populations that underpin key pro- 
cesses, hence deepening our understanding of mi- 
crobial communities and the ecosystem processes 
they mediate. A major opportunity (and challenge) 
for the future is to better interpret this infor- 
mation so that it can be leveraged to predict future 
trajectories of large, microbially-mediated ecosys- 
tem processes. For example, accurate mapping of 
microbial genes and gene products onto the cog- 
nate biogeochemical cycles they catalyze, could 
enable further modeling based on gene distribu- 
tions. Such gene to biogeochemical reaction asso- 
ciations have potential to link microorganisms to 
their activities in specific environmental settings. 
Such distributions can be used to generate hy- 
potheses about the nature of biogeochemical 
feedback loops, and their possible variability un- 
der different scenarios of climate and biogeo- 
chemical change. Omics data is valuable for both 
the parameterization of models [e.g., defining the 
range of different microbial functional groups and 
traits that would be useful to simulate), as well as 
for the validation and tuning of models by com- 
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paring model outputs to 'omics observations and 
biogeochemical process measurements. 

Although there are still many barriers to sur- 
mount, it is now possible to imagine the develop- 
ment of integrated 'omic-biogeochemical- 
ecological models that could be utilized by stake- 
holders and regulators for the effective manage- 
ment and monitoring of water and ecosystem re- 
sources such as fisheries. One of the most obstruc- 
tive barriers is access to multiple data types [envi- 
ronmental data, time series data, organismal dis- 
tributions and their variability, process measure- 
ments, omics datasets, etc.) that are needed to 
drive predictions. Researchers require access to 
'omics data, but also biogeochemical, physical, 
remote sensing data as well. These data types are 
often generated by specialists and the formats are 
not interchangeable, driving the need to for more 
cross talk among different disciplines. Underlying 
science questions here include: 

• How can 'omics data be more effectively 
leveraged into predictive frameworks for 
understanding ecosystem processes and 
their future trajectories? 

• How can 'omics data be better interpreted 
and analyzed using graphical outputs, 
models and indicators, that would be use- 
ful to managers and stakeholders for effi- 
ciently monitoring ecosystem changes and 
their consequences? 

III. Data management^ analytical and 
associated and cyber-infrastructure 
capabilities required to meet critical 
scientific challenges, current and fu- 
ture. 

The attendees of the workshop represented a 
broad representation of the community of users 
and developers; as such these tool recommenda- 
tions stem largely from individual experience 
across a continuum of disciplinary expertise. 

In the context of the science questions and use 
cases discussed above, a number of requirements 
and needs for cyberinfrastructure can be identi- 
fied. Five categories were identified as being of 
immediate importance to improve the archiving of 
and access to data resources, their analyses, ex- 
ploration, and visualization, and their integration 
between microbial genomics, zoology, oceanogra- 
phy, biogeochemistry and other overlapping dis- 
cipUnes: 
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1. Development of integrated omics databases 
is required to enable curation, mainte- 
nance and data standardization, to facili- 
tate primary data submission, extraction 
of raw and processed data, and intelligent 
query of data-resources. Achieving this 
will require tools for rapid and simple data 
query and metadata association. While 
these do exist, they are not suitable for the 
community's needs. In part, this is because 
they were developed without community- 
wide consultation during development. 
Building community concensus is an ardu- 
ous and complicated process, with its own 
downsides. Integration and tool develop- 
ment should incorporate non-sequence- 
based datasets (e.g. metabolomics and 
lipidomics) into existing/emerging ocean- 
ographic 'omics data- 
base/analysis/visualization platforms. En- 
vironmental 'omics databases need to be: 

(a) federated [i.e., all datasets can be 
interoperably queried and transpar- 
ently accessible) 

[b) curated [validated and updated, as for 
example NCBI RefSeq datasets) 

[c) sustained [i.e. a five-year commitment 
of support will not provide sustainable 
infrastructure), and importantly 

(d) intuitively accessible to a broad range 
of scientists, and the public 

2. The ocean 'omics community would bene- 
fit from "Google-like" or "Kayak-like" 
search and suggestion functions and en- 
gines, that could query across complex and 
heterogeneous, federated environmental, 
oceanographic and 'omic databases. How- 
ever, as highlighted above this will require 
significant and sustained investment and 
development. 

3. Tools and procedures are required for ac- 
cess to high performance computing and 
statistical analyses of large scale 'omic da- 
tasets, that could accommodate both naive 
users as well as experienced "power us- 
ers". One possibility is a user facility that 
functions similarly to the UNOLS oceano- 
graphic facilities, that would provide ac- 
cess to software developers, 
bioinformaticians, and analytical tools, as 
well as the hardware [storage facilities, 
servers, coulds, etc) required for 'omic 
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analyses. Researchers could request access 
to this facility in association with success- 
ful grant applications, as with UNOLS. Ex- 
tending the capabilities of BCO-DMO or 
similar services is an alterative approach. 
This framework could also be an efficient 
means of connecting biologists and ocean- 
ographers to bioinformaticians for the 
purpose of tool development, perhaps 
through a special streamlined application 
process such as those used at national la- 
boratories [e.g., synchrotron sources). 

4. Tools are required for more intuitive, ac- 
cessible and integrated visualization of 
linked environmental, 'omic and oceano- 
graphic [and other interdisciplinary) da- 
tasets. Statistical tools and techniques for 
dataset inter-comparison and spatiotem- 
poral modeling also are critical and need 
considerable development to manage the 
scope and scale of both existing and future 
datasets. 

5. The community would benefit from access 
to a web clearing-house/portal with links 
to standard "ocean 'omics" best practices, 
algorithms, software, tutorials, forums, 
and workflows, as well as analytical and 
statistical methods under development, 
with entry points for both naive and pow- 
er users, would be a useful resource for 
the community. Such a resource could also 
facilitate and incentivize the effective dis- 
semination, maintenance, and improve- 
ment of bioinformatic tools. 

Ocean ^omics meeting recommenda- 
tions: next steps 

The workshop attendees discussed some of the 
necessary first steps and enabling activities that 
will help move 'ocean omics science, technology 
and education into the future. 

1. Cross train and educate computer scien- 
tists and engineers, and ocean and earth 
scientists to improve communication and 
collaboration among discipHnes. This in- 
cludes training and education to develop 
cross-disciplinary expertise within and be- 
tween bioinformatics, the Earth sciences, 
and the Ocean sciences. 

2. Facilitate access, availability and utiliza- 
tion of NSF supercomputers for the Earth 
and Ocean sciences communities. Using 
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government supercomputers should be as 
technically easy, and as feasible as access- 
ing the Amazon EC2 grid, especially in re- 
gard to requesting and accessing compute 
cycles. 

3. Plan and initiate a community Research 
Coordination Network to support 
cyberinfrastructure technology and infra- 
structure development and education in 
ocean 'omics. 

4. Promote the development of an EarthCube 
system that would combine the facilitative 
role of the BCO-DMO database [or similar), 
with novel and flexible analytic and visual- 



ization services for exploring ocean 'omics 
oceanographic data [e.g.. Ocean Data View- 
like software and tools, for ocean 'omics 
data). 

5. Further identify ocean 'omics 
cyberinfrastructure "parts" [e.g. dataset 
curators, search engines, high performace 
compute facilities, workflows, user analyt- 
ical facilities, developers, etc.) that are op- 
erational and in use now, and determine 
which ones might be further improved, 
developed, federated, and networked into 
a functional EarthCube community ocean 
'omics cyberinfrastructure solution. 
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